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Breast cancer is one of the significant deaths causing diseases of women around the 
globe. Therefore, high accuracy in cancer prediction models is vital to improving 
patients’ treatment quality and survivability rate. In this work, we presented a new 
method namely improved balancing particle swarm optimization (IBPSO) algorithm 
to predict the stage of breast cancer using unbalanced surveillance epidemiology and 
end result (USEER) data. The work contributes in two directions. First, design and 
implement an improved particle swarm optimization (IPSO) algorithm to avoid the 
local minima while reducing USEER data’s dimensionality. The improvement comes 
primarily through employing the cross-over ability of the genetic algorithm as a fitness 
function while using the correlation-based function to guide the selection task to a min- 
imal feature subset of USEER sufficiently to describe the universe. Second, develop 
an improved synthetic minority over-sampling technique (ISMOTE) that avoid over- 
fitting problem while efficiently balance USEER. ISMOTE generates the new objects 
based on the average of the two objects with the smallest and largest distance from 
the centroid object of the minority class. The experiments and analysis show that the 
proposed IBPSO is feasible and effective, outperforms other state-of-the-art methods; 
in minimizing the features with an accuracy of 98.45%. 
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1. INTRODUCTION 


The surveillance epidemiology and end result (SEER) database is an open cancer database that 
provides different cancers indicators for prognosis prediction. It contains information about the occurrence, 
frequency, survivability, and mortality of cancer. Cancer is typically labeled in stages from 1 to 4, with 4 
being the most serious. The information collected in SEER mostly comes in high dimensionality [2]. Also, 
it is unbalanced, i.e., the objects in stages 2 and 3 are too larger than those in stages 1 and 4. Therefore, 
the database is referred to as an unbalance SEER (USEER) database. The two classes with the least number 
of objects are referred to as minority classes, while the other two are referred to as majority classes. The 
high dimensionality and unbalanced problems often hamper the breast cancer early prediction task and lead 
to delayed and inaccurate results, which degrade the patient’s survival chance. Many research papers have 
been recently directed to address either the high dimensionality problem or the unbalance problem and the 
motivation behind this work is to propose an approach to address both. 
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Data reduction is a data preprocessing technique that aims at preparing the data for prediction. Instead 
of overwhelming the classifier with a huge amount of data, potentially causing many prediction errors, the 
classifier will have an easier job. Although data is shrinking, the fandamental and integrity of the original data 
should be retained. Data reduction decreases the processing time, storage space and computational complexity. 
Data reduction techniques include feature selection and instance selection. We will employ both techniques. 
SEER database contains instances for many types of cancers, while the cancer of interest is breast cancer. In 
this case, only the instances of breast cancer will be selected. This is called instance selection. 

Feature selection (FS) tool is an approach that enables prediction algorithms to be applied to high 
dimensional data with less computations [3]. FS progresses by two steps: feature evaluation and feature set 
search. Feature evaluation evaluates each feature in the dataset separately in terms of its relevance to the class 
variable. On the other hand, feature set search tries various combinations of the evaluated features to arrive at a 
shortlist of features that sufficiently describes the objects [4]. Among the feature evaluation tools is correlation- 
based feature selection (CFS) |5]. The strength of the CFS tool comes in its ability to find a feature subset with 
features that are highly correlated with the class, yet uncorrelated with each other. CFS tool measures the 
goodness f(.@) of set @ of features (W C &) as (1): 


ZAIKA |M | Paia; 
fA=} — ; (1) 
=I 2 VIA HAA] 1) Paa 
where Pa;,a; is the Pearson’s correlation coefficient between features a; and aj and is given by (2): 
Cov(a;,aj) 
a ee 2 
Paas = eg 2 


where Cov(aj,a;) is the covariance which measures of the strength of the correlation between two features a; 
and a; and O4, is the standard deviation of feature ax. 

The hurdle is that the size of the search space increases exponentially concerning the number of 
features, whereas CFS tool needs to assess 2” feature subsets for USEER with n features. Therefore, CFS 
fails miserably when it confronts the USEER. A gap filled by swarm intelligent (SI) algorithms [6] is a low 
cost to search for a feature subset. Some examples of these algorithms are particle swarm optimization (PSO) 
algorithm [7], genetic algorithm (GA) [8], ant colony optimization (ACO) algorithm [9], artificial bee colony 
optimization (BCO) algorithm [10], bat search algorithm (BSA) [L], cuckoo optimization (CO) algorithm 
and elephant herding optimization (EHO) algorithm [13]. The first two are the core of the approach proposed 
in the present article. Its simple operators characterize PSO algorithm and it is computationally inexpensive 
in terms of both memory and cost. PSO is an algorithm that solves FS problem by iteratively improving each 
particle position regarding a given measure of quality. The particle position is represented by a pivot vector 
pointing at a subset of features of the balanced SEER (BSEER). PSO algorithm assumes having a swarm of 
P > 10 particles moving in the search-space according to simple mathematical formula, known as a velocity 
function. The minimum number of particles is 10 because most of the swarms in nature have 10 particles on 
average. In each iteration ¢ > 1, the particle that achieves the highest performance, being closer to the food, 
is referred to as the commander and the rest are slaves. The commander is chosen afresh in each iteration 
t > 1. The commander guides other particles to update their position to converge towards the food. Therefore, 
each particle i = 1,2,...,N in iteration t+ 1 updates its position towards a better position according to the 
commander’s position and the velocity function. At the end of the user defined number of iterations N, this 
exercise is expected to move the swarm toward their food’s best solution. The good thing about PSO is that 
it does not make assumptions about the problem under study and can search high dimensional BSEER for 
minimal feature subset. 

Initially, we consider that we have P > 10 particles. Each particle i= 1,2,...,P at iteration ¢ > 1 has 
its own pivot vector X;, = [x1,X2,..-,%n], where x; € {0,1}. All the particles start at iteration t = 1 by pivoting 
randomly on a feature subset from the whole BSEER features. For each pivot vector X;,, we construct its 
corresponding feature subset A C & by: 

A = {a;| xj = 1}, 


which represents the set of features whose corresponding values in X;, is 1. For example, consider X1, = 
[1,1,0,1,0] => A= {a1,a2,a4}. 
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PSO computes the goodness f (2) of the original set of features o/ in BSEER by (ip. Then, at the 
end of each iteration t > 1, each particle is assessed through computing the goodness of their corresponding 
feature subset. The pivot vector with the corresponding highest goodness f(A) is considered the commander 
and is assigned to %;. Each slave particle i = 1,2,...,P — 1 updates its pivot vector X;,,, in iteration ¢ + 1 with 
respect to ¥; in two steps. First, it computes its velocity V; _, by which the particle updates its pivot vector at 
iteration t+ 1 and is given by: 


t+1 


Vi = Vi ter f(D) XM — cof (A)X;, 


t+1 


where cı and c3 are two positive constants, in which cı +c2 = 4. All the components of the velocity vector V; 
at t = 1 is of value 0. Second, the slave particles i = 1,2, ..., P — 1 find its updated pivot vector X;,,, at iteration 
t+1: 

Xini = x, Vigi (4) 


However, the PSO algorithm creates an undesirable feature subset A C & of USEER which is in- 
sufficient to describe the universe [14]. This is because it employs the k-nearest neighbor (kNN) classifier 
as a fitness function which is frustrated by the unbalanced nature of USEER. This issue is commonly known 
as local minima feature subset; given a feature subset A C æ, A is said to be local minima feature subset 
if f(A) < f(@) for all values in specific interval but not the whole domain. Additionally, PSO algorithm is 
considered classifier-dependent algorithm as the resultant feature subset depends heavily on the accuracy of the 
KNN classifier and this fact, in turn, may result in poor accuracy with other classifiers. This calls for an im- 
proved PSO algorithm to deal with USEER utilized in the present work. GA is an evolutionary algorithm that 
mimics the biological behavior of genes. In contrast to the PSO algorithm, GA employs a cross-over technique 
that can update the feature subset it has been found so far and avoids being trapped in the local minima. 

Definition 1 (Cross-over) Given two pivot vectors Y = [1,0,0,1,1,1,0,1] and Z = [0,1,1,0,0,1,1,0] 
where | in the i” position means that feature a; is selected and 0 otherwise. The cross-over technique randomly 
chooses a position and all bits beyond that position is swapped between the two vectors to generate two new 
vectors. For example, consider that position 4 is chosen then, the two new vectors are Y’ = [1,0,0,1,0,1,1,0] 
and Z’ = [0,1,1,0,1,1,0, 1}. 

Rostami and Zadeh state that the unbalance format of USEER negatively impacts the early pre- 
diction task of breast cancer. This is because the prediction algorithms have unpromising results on a minority 
classes than on majority ones [16]. Attempts to mitigate the unbalance problem are through converting USEER 
to a balanced one. This involves using object sampling (OS) technique that aims to have normally distributed 
objects among classes. OS is classified into two groups: under-sampling and over-sampling. The former 
progress by removing a set of objects from majority classes, while the latter progress by generating a set of 
objects in the minority classes. Its low cost characterizes Under-sampling on the contrary, over-sampling do 
not lose information, but it may result in an over-fitting problem, where the prediction algorithms fit to a spec- 
ified set of objects and result in poor prediction accuracy with un-previously seen objects. Synthetic minority 
over-sampling technique (SMOTE) [17], an example of over-sampling, has put a great effort into balancing the 
USEER. It randomly chooses an object from the minority decision class and finds its k neighbor objects. Then 
it generates a new object by averaging the feature values of the k objects. The process is repeated till we have 
an equal number of objects in each class. This article remedies the FS and OS tasks’ limitations of USEER. The 
rest of this article is organized as follows. Section 2 covers the related work. Section 3 describes the proposed 
approach. In section 4, the experimental work is carried out and discussions are given. Finally, the concluding 
remarks are presented in section 5. 


2. RELATED WORK 

Zhao et al. introduce a predictive model for USEER using univariate and multivariate linear 
regression (LR). They aim to predict the patient’s cancer stage using age, race, tumor size, primary site, patho- 
logical grade, histologic type, and molecular subtype features. However, state that social features are more 
and more emphasized in breast cancer progression. Therefore, they introduce a predictive model for USEER 
to assess the impact of marital status on breast cancer. Furthermore, they used a chi-square method in [20] to 
analyze the associations between marital status and other features and a Kaplan Meier method to estimate sur- 
vival curves. By and large, the models mentioned above result in low accuracy with the prediction algorithms. 
This is because they do not consider the unbalance classes, the main characteristic of USEER. OS technique 
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has been the topic of much research in recent years to alleviate the unbalance nature of USEER. Bertorello and 
Koh use a density-based synthetic minority over-sampling (DSO) method to balance USEER. They use 
different weights for objects in the minority classes. Then they generate new objects regarding objects with 
the highest weight. On the contrary, Luo et al. state that using the objects with the least weight is better 
in sampling to avoid misclassification. Tao et al. propose a new over-sampling technique referred to as 
self-organizing map over-sampling (SOMO) to balance USEER. The SOMO technique generates new objects 
by producing a 2D representation of the input objects in the minority class, then averaging the closest objects. 
Wang combine the strengths of PSO algorithm and CFS tool with two synthetic over-sampling methods; 
borderline-SMOTE and DSO with bayesian network (BN) algorithm and LR. Mirjalili et al. examine 11 
over-sampling techniques and 7 under-sampling techniques on 15 types of cancer. According to the study, 
USEER degrades the performance of classifiers. They state that balancing methods enhance the classification 
of USEER. Han et al. introduce a distribution-sensitive over-sampling technique for balancing USEER. 
They divide the objects into noise, unstable, boundary, and stable objects according to their location in the 
minority class. They use a set of different methods to assess which objects are suitable to generate new objects. 
They use a set of different methods to assess which objects are suitable to generate new objects. Anupama and 
Jena introduce increment over sampling for data streams (IOSDS) algorithm which uses a unique over- 
sampling technique to almost balance USEER. The IOSDS algorithm identifies noisy and mostly misclassified 
objects from the majority and minority classes by employing k-NN classifier. Then, it generates new objects 
in the minority classes using artificial, replication and hybrid objects. The trouble with the above-mentioned 
attempts is that they are sequential in nature, resulting in a delayed prediction. Tarkhaneh and Shen intro- 
duce a Mantegna Lévy flight PSO and neighborhood search (LPSONS) algorithm to reduce the dimensionality 
of the USEER. They combine the strength of a velocity function, PSO algorithm with a Mantegna Lévy distri- 
bution function. This formulation leads to a more diverse feature subset. Additionally, to avoid being trapped 
in local minima, they combine the strengths of both a neighborhood search algorithm and a Mantegna Lévy 
distribution function. Pashaei et al. introduced a binary version of PSO (BPSO) algorithm to avoid being 
trapped in local minima. Then, they combined the strengths of both a BPSO algorithms and a binary black 
hole optimization (BBHO) algorithm to improve the exploration and exploitation steps of BPSO algorithm. 
Afterward, they build a predictive model using a k-NN classifier to predict the patient’s cancer stage early. The 
above attempts have one thing in common they create an undesirable feature subset of USEER because they do 
not consider the unbalanced nature of USEER, a gap that is filled by the present work. Fern “andez-Delgado 
et al. evaluate 179 classifiers from 17 different families Bayesian, neural networks, random forests (RF), 
logistic regression. The RF classifier is at the forefront of the best classifiers. Ganggayah et al. build 
prediction models using decision tree (DT), neural networks, support vector machine (SVM), RF and logistic 
regression algorithms to detect the significant indicators of breast cancer. The results detect the cancer stage 
as one of the most important indicators. The results were close; with the lowest accuracy obtained from DT 
and the highest obtained from RF. A study of analyzed breast cancer at an early stage by comparing the 
performance of DT, RF and SVM. The results find that the RF performance is better than the other techniques 
for predicting cancer at an early stage. This article circumvents these problems by introducing an improved 
SMOTE to avoid the over-fitting problem and introducing an improved PSO algorithm to avoid get trapped in 
local minimal while dealing with the unbalance SEER. 


3. RESEARCH METHOD 


As shown in Figure[I| improved balancing particle swarm optimization (IBPSO) in this work conducts 
the improvement in two main directions: i) feature selection using an improved PSO algorithm (IPSO) and 
ii) balancing USEER using an improved SMOTE (ISMOTE). The first direction is FS which consists of two 
main steps: feature evaluation using CFS and feature set search by IPSO. First, CFS evaluates the relevance 
between each feature in the database and the class variable to find the highest associated features. Second, IPSO 
attempts different combinations of evaluated features to develop the best shortlist of features that adequately 
describe the objects. 

Accordingly, IPSO algorithm is designed as shown in algorithm 1. We try different particles to pick 
the best number that fits the problem until the optimum result is saturated. IPSO algorithm tries a swarm of 
10 < P < 50 particles. IPSO uses the goodness function, given by (1), to assess the goodness of the selected 
feature subset in each iteration. When ¢ = 1, PSO calculates the fitness value of the given BSEER and call it 
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B_goodness; best goodness. To this end, the algorithm aims to search for a minimal feature subset having the 
same B_goodness. A simple loop iterates P times to randomly initialize the P pivot vectors of the P particles, 
calculate the goodness of each corresponding feature subset A. The pivot vector with the highest fitness value is 
assigned to ¥,. Afterward, IPSO employing the cross-over ability of GA to update the P particles pivot vectors 
and iterates till a user-defined number of iterations N is reached or a minimal feature subset with B_goodness 
is found. 


| SEER Dataset — Feature Evaluation by CFS | | Feature Set Search by IPSO 


Data Balancing by ISMOTE | 


Į 


Model 
Building 


Validation 


a Te 
Evaluation 


Figure 1. Process flow for the proposed IBPSO algorithm 


The second direction is balancing data using ISMOTE which starts by finding the centroid for the 
minority decision class. Then, it computes the distance between the objects in the class and the centroid object 
using Euclidean distance. Finally, the newly generated object is the average of the two objects having the 
smallest and the largest distance from the centroid. This task is repeated until we have a BSEER. 


Algorithm 1 Improved particle swarm optimization (IPSO) algorithm 


Input: BSEER 
P, number of particles (10< P< 50) 
N, number of iterations (N>2) 
Output: A, a minimal BSEER feature subset (AC æ) 
t:=1 
B-goodness := f(@) as per (1) 
for i=1 to P do 
Construct randomly the pivot vector Xj, 
A:={a;| xj=1}, set of features whose corresponding values in X, is 1 
Calculate f(A) as per (1) 
end for 
Assign the pivot vector with the corresponding highest f(A) value to &, 
for k=2 to N do 
t:=t4+1 
Cross-over the two slave pivot vectors with the corresponding highest goodness as per 
Definition 1. 
for i=l to P do 
Calculate the pivot vector X, from X; ņ as per BE 
A:= {a;j| xj=1} 
Calculate f(A) as per (1) 
end for 
Assign the pivot vector with the corresponding highest f(A) value to %. 
if f(A) = B_-goodness then 
break. 
end if 
end for 


4. RESULT AND DISCUSSION 

The experiments are conducted on SEER 1973-2016. SEER consists of 10,050,814 observations 
for all cancer types, only 1,631,572 cases diagnosed with breast cancer. From this population, we exclude 
1,383,910 whose cause of death is not breast cancer. We further exclude 93,321 who have an unknown stage. 
Due to the impact of hurricane Katrina, 216 Louisiana cases diagnosed for those six-month period are excluded 
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from the research database. We excluded 280 cases that are not active follow-up, i.e., not keeping in touch 
with the patient for vital status, and exclude 2,770 cases that were not malignant cancers. The final cohort 
in our study on 151,075 with 160 variables. Table [1] shows the number of instances in different stages in 
BSEER before and after applying ISMOTE. The total number of balancing instances generated by ISMOTE 
is 177173. The final number of features selected by IPSO is 36 features. The selected features are survival 
months, first malignant primary indicator, total number of in situ/malignant tumors for a patient, radiation 
recode, chemotherapy recode, radiation sequence with surgery, laterality, histology, regional nodes positive, 
breast subtype, SEER cause-specific death classification, primary site, grade, tumors of adolescents and young 
adults site recode, breast-adjusted N (refers to the number of nearby lymph nodes that have cancer), breast- 
adjusted T (refers to the size and extent of the main tumor), scope of regional lymph node surgery (describes 
the performed procedure of removal, biopsy, or aspiration of regional lymph nodes), surgery of primary site 
(describes a surgical performed procedure that removes and/or destroys the tissue of the primary site), Ap- 
palachia, CS schema-AJCC 6th edition, Indian health service files to identify native Americans, Louisiana, 
month of diagnosis recode, hispanic identification algorithm (uses to classify cases as hispanic or not), record 
number (unique sequential number for each patient identifies the number of records submitted to SEER for 
that particular patient), SEER registry (used in conjunction with Patient ID to uniquely identify a patient), Site 
-mal+ins (mid detail) (which is should be used in conjunction with and only with the variables site specific 
(SS) sequence mal+ins (mid detail), SS sequence 1975+ mal+ins (mid detail), or SS sequence 1992+ mal+ins 
(mid detail) and they are already selected with our algorithm), SS sequence 1975+ -mal (most detail), SS se- 
quence 1992+ mal (most detail), SS sequence 1992+ mal+ins (most detail), and Year of birth. Table [2] shows 
the performance measures of the IBPSO, the results without ISMOTE and the results without IPSO. 


Table 1. Description of BSEER used in the experiments 
Class before ISMOTE (Percentage) After ISMOTE (Percentage) 


stage 1 9973 (6.60%) 19975( 11.3%) 
stage 2 1423 (34.04%) 51423 (29.0%) 
stage 3 83581 (55.32%) 83581 (47.2%) 
stage 4 6098 (4.04%) 22194 (12.5%) 


The best number of particles that fit our problem was 20 particles. To validate the feature subset se- 
lected by IBPSO, we compare the results of IBPSO with five related SI algorithms namely, ACO, BCO, CO, 
BSA and EHO, using 10-fold cross-validation as shown in Table [2] and Table [3] We can see that the perfor- 
mance of the IBPSO is superior in selecting a smaller number of features while keeping its good classification 
performance. 


Table 2. Performance measures for the IBPSO 
Evaluation measure IBPSO Without ISMOTE without IPSO 


Accuracy 98.45% 64.79% 70.8% 
Recall 0.985 0.648 0.708 
Precision 0.985 0.652 0.714 
F-Measure 0.984 0.621 0.694 
MCC 0.974 0.343 0.489 
ROC area 0.987 0.774 0.812 
PRC area 0.986 0.706 0.724 
MAE 0.0934 0.2255 0.2212 
RMSE 0.1579 0.3351 0.325 


To stress on the stability of the IPSO algorithm, Figure [2|shows the receiver operating characteristic 
(ROC) curve for the four cancer stages. ROC curve measures the classification algorithm’s performance, the 
relation between classifier specificity and sensitivity at different thresholds. Classifier sensitivity represents the 
true positive rate, while specificity represents the truly negative rate. The farther the curve is from the diagonal 
line, the higher the overall accuracy of the model. 
There are many problems with the SEER database: 
— There are many blank(s) fields or unknown data. Unfortunately, excluding all blank(s) and unknown data 
leads to an empty matrix. This makes it impossible to remove all of them. So, we remove only blank and 
unknown fields from target features. 
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— Many features are gathered in only a specific period, which leads to inconsistency in the data. All these 
features are eliminated from our study. We select only data collected after 2010. 

There were some degrees of missing in our data, but SEER encode the missing data; therefore, algo- 
rithms may be puzzled and deal with it as complete data. So, our further work would include other preprocess- 
ing techniques for missing data rather than deletion. Also, approach for hyperparameter optimization for the 
model parameter. 


Table 3. Comparative experiments result of different SI algorithms 


Evaluation measure BSA Co ACO BCO EHO 
# selected features 33 30 43 27 37 
Accuracy 69.03% 70.74% 94.05% 64.36% 64.83% 
Recall 0.69 0.707 0.941 0.644 0.648 
Precision 0.694 0.713 0.941 0.663 0.654 
F-Measure 0.672 0.694 0.94 0.604 0.625 
MCC 0.455 0.488 0.899 0.362 0.371 
ROC Area 0.775 0.812 0.991 0.775 0.765 
PRC Area 0.669 0.728 0.986 0.689 0.684 
MAE 0.2383 0.2208 0.0965 0.2454 0.228 
RMSE 0.3364 0.3248 0.1797 0.3417 0.3376 


Sensitivity 


0 0.3 0.55 0.8 1 


1-Specificity 


Figure 2. ROC curve for the four cancer stages 


5. CONCLUSION 

The proposed approach, IBPSO, is designed principally to process USEER data and predict the stage. 
IBPSO conducts the improvement in two directions. First, design and implement an IPSO algorithm to avoid 
being trapped in local minima while reducing USEER data’s dimensionality. The improvement comes primarily 
through employing the cross-over ability of the genetic algorithm (GA) as a fitness function while using the 
correlation-based function to guide the selection task in IPSO algorithm. This idea leads to a minimal feature 
subset of USEER sufficiently to describe the universe. Second, develop an ISMOTE that avoid over-fitting 
problem while efficiently balance USEER. ISMOTE generates the new objects based on the average of the two 
objects with the smallest and largest distance from the centroid object of the minority class. The results show 
that IBPSO outperforms the related algorithms to find out a minimal feature subset with an accuracy of 98.45%. 
The classification accuracy of IBPSO is promising and superior to those achieved with different methods. 
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